8 research outputs found

    Misspecification in Inverse Reinforcement Learning

    Full text link
    The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function RR from a policy π\pi. To do this, we need a model of how π\pi relates to RR. In the current literature, the most common models are optimality, Boltzmann rationality, and causal entropy maximisation. One of the primary motivations behind IRL is to infer human preferences from human behaviour. However, the true relationship between human preferences and human behaviour is much more complex than any of the models currently used in IRL. This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data. In this paper, we provide a mathematical analysis of how robust different IRL models are to misspecification, and answer precisely how the demonstrator policy may differ from each of the standard models before that model leads to faulty inferences about the reward function RR. We also introduce a framework for reasoning about misspecification in IRL, together with formal tools that can be used to easily derive the misspecification robustness of new IRL models

    Lexicographic Multi-Objective Reinforcement Learning

    Full text link
    In this work we introduce reinforcement learning techniques for solving lexicographic multi-objective problems. These are problems that involve multiple reward signals, and where the goal is to learn a policy that maximises the first reward signal, and subject to this constraint also maximises the second reward signal, and so on. We present a family of both action-value and policy gradient algorithms that can be used to solve such problems, and prove that they converge to policies that are lexicographically optimal. We evaluate the scalability and performance of these algorithms empirically, demonstrating their practical applicability. As a more specific application, we show how our algorithms can be used to impose safety constraints on the behaviour of an agent, and compare their performance in this context with that of other constrained reinforcement learning algorithms

    Is SGD a Bayesian sampler? Well, almost

    Full text link
    Overparameterised deep neural networks (DNNs) are highly expressive and so can, in principle, generate almost any function that fits a training dataset with zero error. The vast majority of these functions will perform poorly on unseen data, and yet in practice DNNs often generalise remarkably well. This success suggests that a trained DNN must have a strong inductive bias towards functions with low generalisation error. Here we empirically investigate this inductive bias by calculating, for a range of architectures and datasets, the probability PSGD(f∣S)P_{SGD}(f\mid S) that an overparameterised DNN, trained with stochastic gradient descent (SGD) or one of its variants, converges on a function ff consistent with a training set SS. We also use Gaussian processes to estimate the Bayesian posterior probability PB(f∣S)P_B(f\mid S) that the DNN expresses ff upon random sampling of its parameters, conditioned on SS. Our main findings are that PSGD(f∣S)P_{SGD}(f\mid S) correlates remarkably well with PB(f∣S)P_B(f\mid S) and that PB(f∣S)P_B(f\mid S) is strongly biased towards low-error and low complexity functions. These results imply that strong inductive bias in the parameter-function map (which determines PB(f∣S)P_B(f\mid S)), rather than a special property of SGD, is the primary explanation for why DNNs generalise so well in the overparameterised regime. While our results suggest that the Bayesian posterior PB(f∣S)P_B(f\mid S) is the first order determinant of PSGD(f∣S)P_{SGD}(f\mid S), there remain second order differences that are sensitive to hyperparameter tuning. A function probability picture, based on PSGD(f∣S)P_{SGD}(f\mid S) and/or PB(f∣S)P_B(f\mid S), can shed new light on the way that variations in architecture or hyperparameter settings such as batch size, learning rate, and optimiser choice, affect DNN performance

    On The Expressivity of Objective-Specification Formalisms in Reinforcement Learning

    Full text link
    To solve a task with reinforcement learning (RL), it is necessary to formally specify the goal of that task. Although most RL algorithms require that the goal is formalised as a Markovian reward function, alternatives have been developed (such as Linear Temporal Logic and Multi-Objective Reinforcement Learning). Moreover, it is well known that some of these formalisms are able to express certain tasks that other formalisms cannot express. However, there has not yet been any thorough analysis of how these formalisms relate to each other in terms of expressivity. In this work, we fill this gap in the existing literature by providing a comprehensive comparison of the expressivities of 17 objective-specification formalisms in RL. We place these formalisms in a preorder based on their expressive power, and present this preorder as a Hasse diagram. We find a variety of limitations for the different formalisms, and that no formalism is both dominantly expressive and straightforward to optimise with current techniques. For example, we prove that each of Regularised RL, Outer Nonlinear Markov Rewards, Reward Machines, Linear Temporal Logic, and Limit Average Rewards can express an objective that the others cannot. Our findings have implications for both policy optimisation and reward learning. Firstly, we identify expressivity limitations which are important to consider when specifying objectives in practice. Secondly, our results highlight the need for future research which adapts reward learning to work with a variety of formalisms, since many existing reward learning methods implicitly assume that desired objectives can be expressed with Markovian rewards. Our work contributes towards a more cohesive understanding of the costs and benefits of different RL objective-specification formalisms

    Goodhart's Law in Reinforcement Learning

    Full text link
    Implementing a reward function that perfectly captures a complex task in the real world is impractical. As a result, it is often appropriate to think of the reward function as a proxy for the true objective rather than as its definition. We study this phenomenon through the lens of Goodhart's law, which predicts that increasing optimisation of an imperfect proxy beyond some critical point decreases performance on the true objective. First, we propose a way to quantify the magnitude of this effect and show empirically that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart's law for a wide range of environments and reward functions. We then provide a geometric explanation for why Goodhart's law occurs in Markov decision processes. We use these theoretical insights to propose an optimal early stopping method that provably avoids the aforementioned pitfall and derive theoretical regret bounds for this method. Moreover, we derive a training method that maximises worst-case reward, for the setting where there is uncertainty about the true reward function. Finally, we evaluate our early stopping method experimentally. Our results support a foundation for a theoretically-principled study of reinforcement learning under reward misspecification

    Neural networks are a priori biased towards Boolean functions with low entropy

    Full text link
    Understanding the inductive bias of neural networks is critical to explaining their ability to generalise. Here, for one of the simplest neural networks -- a single-layer perceptron with n input neurons, one output neuron, and no threshold bias term -- we prove that upon random initialisation of weights, the a priori probability P(t) that it represents a Boolean function that classifies t points in {0,1}^n as 1 has a remarkably simple form: P(t) = 2^{-n} for 0\leq t < 2^n. Since a perceptron can express far fewer Boolean functions with small or large values of t (low entropy) than with intermediate values of t (high entropy) there is, on average, a strong intrinsic a-priori bias towards individual functions with low entropy. Furthermore, within a class of functions with fixed t, we often observe a further intrinsic bias towards functions of lower complexity. Finally, we prove that, regardless of the distribution of inputs, the bias towards low entropy becomes monotonically stronger upon adding ReLU layers, and empirically show that increasing the variance of the bias term has a similar effect
    corecore